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ABSTRACT 

Libpsht (or "library for performant spherical harmonic transforms") is a collection of algorithms for efficient conversion between 
spatial-domain and spectral-domain representations of data defined on the sphere. The package supports both transforms of scalars 
and spin-1 and spin-2 quantities, and can be used for a wide range of pixelisations (including HEALPix, GLESP, and ECP). It will take 
advantage of hardware features such as multiple processor cores and floating-point vector operations, if available. Even without this 
additional acceleration, the employed algorithms are among the most efficient (in terms of CPU time, as well as memory consumption) 
currently being used in the astronomical community. 

The library is written in strictly standard-conforming C90, ensuring portability to many different hard- and software platforms, and 
allowing straightforward integration with codes written in various programming languages like C, C++, Fortran, Python etc. 
Libpsht is distributed under the terms of the GNU General Public License (GPL) version 2 and can be downloaded from http : 
// sourcef orge . net/pro j ects/libpsht/ 

Key words, methods: numerical - cosmic background radiation - large-scale structure of the Universe 



1. Introduction 

Spherical harmonic transforms (SHTs) have a wide range of ap- 
plications in science and engineering. In the case of CMB sci- 
ence, which prompted the development of the library presented 
in this paper, they are an essential building block for extracting 
the cosmological power spectrum from full-sky maps, and are 
also used for generating synthesised sky maps in the context of 
Monte Carlo simulations. For all recent experiments in this field, 
the extraction of spherical harmonic coefficients has to be done 
up to very high multipole moments (up to 10 4 ), which makes 
SHTs a fairly costly operation and therefore a natural candidate 
for optimisation. It also gives rise to a number of numerical com- 
plications that must be addressed by the transform algorithm. 

The purpose of SHTs is the conversion between functions (or 
maps) on the sphere and their representation as a set of spheri- 
cal harmonic coefficients in the spectral domain. In CMB sci- 
ence, such transforms are most often needed for quantities of 
spin (i.e. quantities which are invariant with respect to rotation 
of the local coordinate system) and spin 2, because the unpo- 
larised and polarised components of the microwave radiation are 
fields of these respective types. For applications concerned with 
gravitational lensing, SHTs of spins 1 and 3 are also sometimes 
required. 

The paper is organised as follows. The following section 
lists the underlying equations anf the motivations for developing 
libpsht, as well as the goals of the implementation. A detailed 
explanation of the library's inner workings is presented in Sect. 
[3] and Sect. [4] contains a high-level overview of the interface 
it provides. Detailed studies regarding libpsht's performance, 
accuracy, and other quality indicators were performed, and their 
results presented in Sect. [5] Finally, a summary of the achieved 
(and not completely achieved) goals is given, together with an 
outlook on planned future extensions. 

Send offprint requests to: M. Reinecke, 
e-mail: martin@mpa-garching . mpg . de 



2. Code capabilities 

2.1. SHT equations 

A continuous spin- s function f(§, <p) with a spectral band limit 
of Imax is related to its corresponding set of spin spherical har- 
monic coefficients ,«/,„ by the following equations: 

'max 'max 

/(#,</>) = ^ ^ s a lm s Yi m (p, (p) (1) 

m--/ max /-|m| 

X7T r*1ix 
d<pd&sm&m<p) s Y^J&,<p). (2) 
=o J^=o 

Eqs. ([TJ and |2]i are known as backward (or synthesis) and 
forward (or analysis) transforms, respectively. For a discretised 
spherical map consisting of a vector p of Af p j x pixels at loca- 
tions (§, <p) and with (potentially weighted) solid angles w, they 
change to: 

'max 'max 

Pn = X X sQlm s Y 'm(&n,<P„) (3) 

m=- ^max /=|m| 

A m = ^p n w ns Y* lm {§ n ,<p n ). (4) 

n=l 

Depending on the choice of Z max , w, and the grid geometry, the 
p — > s ai,„ transform may only be approximate, which is indicated 
by choosing the identifier a instead of a. 

The main purpose of the presented code is the efficient im- 
plementation (regarding CPU time as well as memory consump- 
tion) of Eqs. ([3]l and Q for scalars as well as tensor quantities 
of spins +1 and ±2. 
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2.2. Design considerations 

At present, several SHT implementations are in use within 
the CMB community, like those distribut ed with the Fortran 
and C++ implementations of HEALP i?^ ( Gors ki et al. 2005 K 
GLESlfl dDoroshkevich et al.l |2005| >, s2hat0 spinsfasfF 
(Huff enberger & Wandeltj |2010) , ccSHTp] and IGLOC£ 
(Crittenden & Turok| |1998[ ); they ofFer varying degrees of per- 
formance, flexibility and feature completeness. The main design 
goal for libpsht was to provide a code which performs better 
than the SHTs in these packages and is versatile enough to serve 
as a drop-in replacement for them. 

This immediately leads to several technical and pragmatic 
requirements which the code must fulfill: 

Efficiency: The transforms provided by the library must perform 
at least as well (in terms of CPU time and memory) as those 
in the available packages, on the same hardware. If possi- 
ble, all available hardware features (like multiple CPU cores, 
vector arithmetic instructions) should be used. 

Flexibility: The library needs to support all of the above dis- 
cretisations of the sphere; in addition, it should allow trans- 
forms of partial maps. 

Portability: It must run on all platforms supported by the soft- 
ware packages above. 

Universality: It must provide a clear interface that can be con- 
veniently called from all the programming languages used 
for the packages above. 

Completeness: If feasible, it should not depend on external li- 
braries, because these complicate installation and handling 
by users. 

Compactness: In order to simplify code development, mainte- 
nance and deployment, the library source code should be 
kept as small as possible without sacrificing readability. 

It should be noted that the code was developed as a library for 
use within other scientific applications, not as a scientific appli- 
cation by itself. As such it strives to provide a comprehensive 
interface for programmers, rather than ready-made facilities for 
users. 



2.3. Supported discretisations 



For completely general discretisations of the sphere, the oper- 
ation count for SHTs is <9(Z max Af p ; x ) in both directions, which 
is not really affordable for today's resolution requirements. 
Fortunately, approaches of lower complexity exist for grids with 
the following geometrical constraints: 



jVpi x iso-latitude 



- the map pixels are arranged on N# 5 
rings, where the colatitude of each ring is denoted as ft y 

- within one ring, pixels are equidistant in tp and have identical 
weight w y ; the number of pixels in each ring can vary and is 
denoted as N^ y , and the azimuth of the first pixel in the ring 
is <poj. 



http://healpix.jpl.nasa.gov 

2 http://glesp.nbi .dk 

3 http : //www . ape . uni v- paris7 . fr/~radek/ s2hat . html 



4 http: //www. physics. miami . edu/~huffenbe/research/ 
|spinsfast| 



http://crd.lbl .gov/~cmc/ccSHTlib/doc 
http : //www . ci ta . utoronto . ca/~ crittend/pixel . html 



Under these assumptions, Eqs. <|3j and Q can be written as 

Pxy = X s<llm s ^ lm ^y> eX P \ im f0,y + J (5) 

m=-Ui=M V v ' y ' 

Afe-i# w -i / 2nimx\ 

Am = ^ ^ p^WysAimi&ylexp l-imipoj — - — , (6) 

y=0 x=0 ' ' 

where s Ai, n {ff) :- s Yi m (-&, 0). By subdividing both transforms into 
two stages 



Pxy = 
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m--/ r 



2nimx \ 
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^ G m , y s Ai m (§ y ) with 
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5=0 
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G m ,y := Wy ^ 



Wy 7 , Pxy exp \-im<p , 



2mmx \ 



N, 



<p,y I 



(7) 
(8) 
(9) 
(10) 



it becomes obvious that the steps s a; m — > F m y and G m y — » s a; m 
require 0(N§l^ n!ix ) operations, while a set of fast Fourier trans- 
forms - with the subdominant complexity of 0(N§l msix log l m . dx ) - 
can be used for the steps F m>y — > p xv and p KV — > G mjV , respec- 
tively. For any "useful" arrangement of the rings on the sphere, 
N§ will be roughly proportional to / max , resulting in <9(/^ ax ) op- 
erations per SHT. 

Practically all spherical grids which are currently used in 
the CMB field (HEALPix, ECP, GLESP, IGLOO) fall into this 
category, so libpsht can take advantage of this simplification. 
A notable exception is the "quadrilateral spherical cube" grid 
used by the COBE mission, but this is not really an issue, since 
COBE's resolution is so low that nowadays even brute-force 
SHTs produce results in acceptable time. 

For the special case of the ECP grid, which is equidis- 
tant in both # and <p, algorithms of the even lowe r complex- 



Driscoll & 



ity <5(/^ ax log 2 / m ax) have been presented (see, e.g., 
Healy|1994 Wiaux et al.|2007[ ). These were not considered for 
libpsht, because this would conflict with the flexibility goal. 
In addition, it has been observed that the constant prefactor for 
these methods is fairly high, so that they do not offer any perfor- 
mance advantage for small and intermediate Z max ; for transforms 
with high /max, on the other hand, their memory consumption is 
prohibitively high. 

2.4. Transformable quantities 

Without limiting the general applicability of the provided SHTs, 
libpsht imposes a few constraints on the format of the data 
it processes. For scalar transforms, it will only work with real- 
valued maps (the ubiquitous scenario in CMB physics); this is 
not really a limitation, since transforming a complex-valued map 
can be achieved by simply transforming its real and imaginary 
parts independently and computing a linear combination of the 
results. 

For SHTs with s ^ 0, libpsht works on the so-called gra- 
dient (E) and curl (B) components of the tensor field 



\s\El,„ :- (-1) 2^\ a lm + ( - l) S -|s|«ta) 
i\s\Bim := {-l) H \{\ s \ai m - (-\) S -\ s \aim) 



(ID 

(12) 
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where H E* lm = (-\) yE,,- m , M B! = (-l) m N fi,,_ m and (-l) H 
is a sign convention ( |Lewis|2 005 ). The default setting for H in 
libpsht is 1, following the HEALPix convention, but this can 
be changed easily if necessary. On the real-space side, the tensor 
field is represented as two separate maps, containing the real and 
imaginary part, respectively. 

In the following discussion, no fundamental distinction ex- 
ists between \ s \Ei m and \ s \Bi m , so for the sake of brevity both will 
be identified as s a\ m . 

This choice of representation was made because it is compat- 
ible with the data formats used by most existing packages, which 
simplifies interfacing and data exchange. It also has the impor- 
tant advantage that any set of spherical harmonic coefficients is 
completely specified by its elements with m >- 0, since s at„, and 
s ai- m are coupled by symmetry relations. 



3. Algorithm description 

3.1. Calculation of the spherical harmonic coefficients 

The central building block for every SHT implementation is 
the quick and accurate computation of spherical harmonics 
s Yi m {§,ip); for libpsht, these need only be evaluated at <p = 
and will be abbreviated as s Ai m (§). 

3.1.1. Scalar case 

For s = 0, a proven method of determining the coefficients is 
using the recursion relation 



hm - COS#A/ m /l/_i, m - Bi m Ai-2j 

with 



4l 2 -l A, m 

A lm-= A/rt i and B lm - 



J2 _ m 2 



At-] 



(13) 



(14) 



in the direction of increasing /; the starting values A mm can be ob- 
tained from Aqq = {AnY 112 and the additional recursion relation 



= —sin& 



2m + 1 
2m 



Am— 1 _m— 1 



(15) 



( Varshalovich et al.|1988 1. Since two values are required to start 
the recursion in /, one also uses A m+ ] m = A mm V2w + 3. 

Both recursions are numerically stable; nevertheless, their 
evaluation is not trivial, since the starting values A mm can become 
extremely small for increasing m and decreasing sin i?, such that 
they can no longer be represented by the IEEE754 number for- 
mat used by virtually all of today's computers. This problem was 
addressed using the approach described by Preze au & Reinecke] 
(2010), i.e. by augmenting the IEEE representation with an ad- 
ditional integer scale factor, which serves as an "enhanced ex- 
ponent" and dramatically increases the dynamic range of repre- 
sentable numbers. 

As is to be expected, this extended range comes at a 
somewhat higher computational cost; therefore the algorithm 
switches to standard IEEE numbers as soon as the A/,„ have 
reached a region which can be safely represented without a scale 
factor. From that point on, the recursion in / only requires three 
multiplications and one subtraction per A value, which are very 
cheap instructions on current microprocessors. 

From the above it follows that, depending on the exact pa- 
rameters of the SHT, a non-negligible fraction of the s /l/ m (i?) is 
extremely small, and that all terms in Eqs. ([8]) and (|9| contain- 
ing such values as a factor do not contribute in a measurable 



way to the SHT's result, due to the limited dynamic range of the 
floating-point format in which the result is produced. For that 
reason, libpsht records the index / t h, at which the \ s Ai m (§)\ first 
exceed a threshold e t h (which is set to the conservative value of 
10 -30 ) during every I recursion. The remaining operations will 
then be carried out for / > l± only. 

Another acceleration is achieved by skipping the /-recursion 
for an (m, #)-tuple entirely, if another recursion was performed 
before with m c < m and | cos § c \ < | cos §\ < 1 that never crossed 
£th. 

In the case that a ring at colatitude § has a counterpart at 
7T--& (which is almost always the case for the popular grids), the 
s Ai m {ff) recursion need only be performed for one of those; the 
values for the other ring are quickly obtained by the symmetry 
relation 



s A lm {n - §) = {-\t s _ s A,„m 



(16) 



Libpsht will identify such ring pairs automatically and treat 
them in an optimised fashion. 

3.1.2. Tensor case 

Following Lewis (2005), libpsht does not literally compute the 
tensor harmonics ±s Ai m , but rather their linear combinations 



(17) 



These are not directly obtained using a recursion, but rather by 
first computing the spin-0 A\ m and subsequently applying spin 
raising and lowering operators; see appendix A of |Lewis| (2005) 
for the relevant formulae. Whenever scalar and tensor SHTs are 
performed simultaneously, this has the advantage that the scalar 
Aim can be re-used for both; in CMB science this scenario is very 
common, since any transform involving a polarised sky map re- 
quires SHTs for both s — and s = ±2. 

The approach can be extended to higher spins in principle, 
but for \s\ > 2 the expressions become quite involved and are 
also slow to compute. For this reason, libpsht's transforms are 
currently limited to s = 0, + 1,±2. Should a need for SHTs at 
higher s become evident, a method making use of the relation- 
ships between s A/ m and the Wigner d' mm , matrix elements seems 
most promising, since they obey very similar recursion relations 
(see, e.g., Kostelec & Rockmore 2008), for which a recent im- 
plementation is available ( |Prezeau & Reinecke|2010| l . 

3.1 .3. Re-using the computed s Ai, n (§) 

Many packages allow the user to supply an array containing pre- 
computed s Ai„,{§) coefficients to the SHT routines, which can 
speed up the computation by a factor of up to w 2. Libpsht does 
not provide such an option; this may seem surprising at first, but 
it was left out deliberately for the following reasons: 

- Even if supplied externally, the coefficients have to be ob- 
tained by some method. Reading them from mass storage is 
definitely not a good choice, since they can be computed on- 
the-fly much more quickly on current hardware. Externally 
provided values can therefore only provide a benefit if they 
are used in at least two consecutive SHTs. 

- When using multiple threads in combination with precom- 
puted s Ai,„(§), computation time soon becomes dominated 
by memory access operations to the table of coefficients, be- 
cause the memory bandwidth of a computer does not scale 
with the number of processor cores. This will severely limit 
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the scalability of the code. It is safe to assume that this prob- 
lem will become even more pronounced in the future, since 
CPU power tends to grow more rapidly than memory band- 
width prepper|2007l l. 

- For problem sizes with Z max ^ 2000, when additional per- 
formance would be most desirable, the precomputed tables 
(whose size is proportional to N^t^) become so large that 
they no longer fit into main memory. A conservative estimate 
for the size of the s Ai„,(§) array at Z max = 2000 and s = is 
about 8GB; the Fortran HEALPix library currently requires 
32GB in this situation. 

- Libpsht supports multiple simultaneous SHTs, and will 
compute the s A/ m {&) only once during such an operation, re- 
using them for all individual transforms. Since this is done 
in a piece-by -piece fashion (see Sect. 3.3 1, only small sub- 
sets of size 0(l mdx ) need to be stored, which is more or less 
negligible; this approach also implies that the internally pre- 
computed s Ai m {§) subsets will most likely remain in the CPU 
cache and therefore not be subject to the memory bandwidth 
limit. 



3.2. Fast Fourier transform 

Libpsht uses an adapted version of the well-known fftpack 
library ( |Swarztrauber|| 1982| , ported to the C language (with a 
few minor adjustments) by the author. This package performs 
well for FFTs of lengths N whose prime decomposition only 
contains the factors 2, 3, and 5. For prime N, its complexity de- 
grades to 0(N 2 ), which should be avoided, so for N contain- 
ing large prime factors, Bluestein's algorithm (Bluestein 1968) 
is employed, which allows replacing an FFT of length N by a 
convolution (conveniently implemented as a multiplication in 
Fourier space) of any length > 2N - 1. By appropriate choice 
of the new length, the desired 0(N log N) complexity can be 
reestablished, at the cost of a higher prefactor. 

Obviously it would have bee n possible to use the ve ry pop- 
ular and efficient FFTiQ library ( Frigo & Johnson|2005 1 for the 
Fourier transforms, but since this package is fairly heavyweight, 
this would conflict with libpsht's completeness or compact- 
ness goals. Even though libff tpack with the Bluestein modifi- 
cation will be somewhat slower than FFTW, the efficiency goal is 
not really compromised: libpsht's algorithms have an overall 
complexity of 0{N,flf wx ), so that the FFT part with its total com- 
plexity of 0(N,fl m . dx log / max ) will (for nontrivial problem sizes) 
never dominate the total CPU time. 



3.3. Loop structure of the SHT 

As Eqs. (|5jl and |6j and the complexity of (9(Af#^ ax ) already 
suggest, a code implementing SHTs on an iso-latitude grid will 
contain at least three nested loops which iterate on y, m, and /, re- 
spectively. The overall speed and/or memory consumption of the 
final implementation depend sensitively on the order in which 
these loops are nested, since different hierarchies are not equally 
well suited for precomputation of common quantities used in the 
innermost loop. 

The choice made for the computation of the s Ai m (ff), i.e. to 
obtain them by /-recursion (see Sect. 3. 1 1 strongly favors placing 



the loop over I innermost; so only the order of the y and m loops 
remains to be determined. 

From the point of view of speed optimisation, arguably the 
best approach would be to iterate over m in the outermost loop, 



then over the ring number y, and finally over /. This allows pre- 
computing the A\ m and B\ m quantities (see Sect. 3.1 1 exactly once 
and in small chunks, for an additional memory cost of only 
0(l mlLX ). However, this method requires additional 0(N§l m:ix ) 
storage for the intermediate products F mj) or G m , y , which is com- 
parable to the size of the input and output data. This was con- 
sidered too wasteful, since it prohibits performing SHTs whose 
input and output data already occupy most of the available com- 
puter memory. 

Unfortunately, switching the order of the two outermost 
loops does not improve the situation: now, one is presented with 
the choice of either precomputing the full two-dimensional A\ m 
and Bi„, arrays, which is not acceptable for the same reason that 
prohibits storing the entire F mj , and G„ uy arrays, or recomputing 
their values over and over, which consumes too much CPU time. 

To solve this dilemma, it was decided not to perform the 
SHTs for a whole map in one go, but rather to subdivide them 
into a number of partial transforms, each of which only deals 
with a subset of rings (or ring pairs). An appropriate choice for 
the number of parts will result in near-optimal performance com- 
bined with only small memory overhead; the most symmetri- 



http://fftw.org 



cal choice of Nj leads to a time overhead of 0{Nj Z^ ax ) and 

additional memory consumption of 0(N^ 2 l max ), both of which 
are small compared to the total CPU and storage requirements. 
Libpsht uses sub-maps of roughly 100 or N#/10 rings (or ring 
pairs), whichever is larger. 

The arrangement described above is sufficient for a single 
SHT. Since libpsht supports multiple simultaneous SHTs, an- 
other loop hierarchy had to be introduced. A pseudo-code listing 
illustrating the complete design of the SHT algorithm (includ- 
ing the subdominant parts performing the FFTs) is presented in 
Fig.0 



3.4. Further acceleration techniques 

To make use of multiple CPU cores, if available, the algorithm's 
loop over m, as well as the code blocks performing the FFTs, 
have been parallelised using OpenMP^jdirectives, requesting dy- 
namic scheduling for every iteration; this is indicated by the 
"OpenMP" tag in Fig.[T] As long as the SHTs are large enough, 
this results in very good scaling with the number of cores avail- 
able (see also Sect. |5. 2.4| >. 

If the target machine supports the SSE2 instruction se|^] 
(which is the case for all AMD/Intel CPUs introduced since 
around 2003), its ability to perform two arithmetic operations of 
the same kind with a single machine instruction is used to pro- 
cess two ring (pairs) simultaneously, greatly improving the over- 
all execution speed. The relevant loop is marked with "SSE2" in 
Fig. [T] The necessary changes are straightforward for the most 
part; however, special attention must be paid to the I recursion, 
because the two sequences of s /t/ m ()?) will cross the threshold 
to lEEE-representable numbers at different l±. Fortunately this 
complication can be addressed without a significant slowdown 
of the algorithm. 

The data types used and the functions called in order to use 
the SSE2 instruction set are not part of the C90 language stan- 
dard, and will therefore not be known to all compilers and on 
all hardware platforms. However there is a portable way to de- 
tect compiler and hardware support for SSE2; if the result of this 
check is negative, only the standard-compliant floating-point im- 

8 http://opertmp.orgJ 
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for b = <all submaps or "blocks"> 




for y = <all rings in submap b> 


// OpennP 


for j = <all map->a_lm jobs> 




compute G(m, theta_y , j ) using FFT 


// eq. 


end j 


end y 




for m = [0;mmax] 


// OpennP 


for 1 = [m;lmax] 




compute A(l,m) and B(l,m) 


// eq. 


end 1 


for y = <all rings in submap b> 


// SSE2 


for s = <all required spins> 




for 1 = [m;lmax] 




compute s_Ylm(theta_y) 


// eq. <| 1 3 p 


end 1 


end s 




for j = <all jobs> 




if (a_lm->map job) 




for 1 = [m;lmax] 




accumulate F(m,theta_y, j) 


// eq. i[8) 


end 1 




else /* map->a_lm job */ 




for 1 = [m;lmax] 




compute a_lm(j) 


// eq. i(9) 


end 1 




endif 




end j 




end y 




end m 




for y = <all rings in submap b> 


// OpenMP 


for j = <all a_lm->map jobs> 




compute map(x,y,j) using FFT 


// eq. 


end j 


end y 




end b 





Fig. 1. Schematic loop structure of libpsht's transform algo- 
rithm 



plementation of libpsht will be compiled, without the neces- 
sity of user intervention. 

Similarly, the OpenMP functionality is accessed using so- 
called #pragma compiler directives, which are simply ignored 
by compilers lacking the required capability, reducing the source 
code to a single-threaded version. 

Both extensions are supported by the widely used gccf^Jand 
Intel compilers. 



4. Programming interface 

Only a high-level overview of the provided functionality can be 
given here. For the level of detail necessary to actually interface 
with the library, the reader is referred to the technical documen- 
tation provided alongside the source code. 



4.1. Description of input/output data 

Due to libpsht's universality goal, it must be able to accept 
input data and produce output data in a wide variety of storage 
schemes, and therefore only makes few assumptions about how 
maps and ci\ m are arranged in memory. 



A tesselation of the sphere and the relative location of its 
pixels in memory is described to libpsht by the following set 
of data: 

- the total number of rings in the map; and 

- for every ring 

- the number of pixels N v y, 

- the index of the first pixel of the ring in the map array; 

- the stride between indices of consecutive pixels in this 
ring; 

- the azimuth of the first pixel <po $y ; 

- the colatitude 

- the weight factor w y . This is only used for forward SHTs 
and is typically the solid angle of a pixel in this ring. 

For ai m , l ess information is necessary: 

- the maximum / and m quantum numbers available in the set 

- the index stride between a/,,, and a/ + i jm ; 

- an integer array containing the (hypothetical) index of the 
element ao m . 



http://gcc.gnu.org 



Due to the convention described in Sect. 2.4 only coefficients 
with m > will be accessed. 

These two descriptions are flexible enough to describe all 
spherical grids of interest, as well as all data storage schemes 
that are in widespread use within the CMB community. In com- 
bination with a data pointer to the first element of a map or set 
of ai m , they describe all characteristics of a set of input or output 
data that libpsht needs to be aware of. 

It should be noted that the map description trivially supports 
transforms on grids that only cover parts of the # range; rings 
that should not be included in map synthesis or analysis can sim- 
ply be omitted in the arrays describing the grid geometry, speed- 
ing up the SHTs on this particular grid. More general masks, 
which affect parts of rings, have to be applied by explicitly set- 
ting the respective map pixels to zero before map analysis or 
after map synthesis. 

Libpsht will only work on data that is already laid out 
in main memory according to the structures mentioned above. 
There is no functionality to access data on mass storage directly; 
this task must be undertaken by the programs calling libpsht 
functionality. Given the bewildering variety of formats which are 
used for storing spherical maps and a/„, on disk - and with the 
prospect of these formats changing more or less rapidly in the 
future - I/O was not considered a core functionality of the li- 
brary. 

4.2. Supported floating-point formats 

In addition to the flexibility in memory ordering, libpsht can 
also work on data in single- or double-precision format, with the 
requirement that all inputs and outputs for any set of simulta- 
neous transforms must be of the same type. Internally, however, 
all computations are performed using double-precision numbers, 
so that no performance gain can be achieved by using single- 
precision input and output data. 



4.3. Multiple simultaneous SHTs 

As already mentioned, it is possible to compute several SHTs at 
once, independent of direction and spin, as long as their map 
and ai m storage schemes described above are identical. This 
is achieved by repeatedly adding SHT requests, complete with 
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pointers to their input and output arrays, to a "job list", and fi- 
nally calling the main SHT function, providing the job list, map 
information and a\ m information as arguments. 

4.4. Calling the library from other languages 

Due to the choice of C as libpsht's implementation language, 
the library can be interfaced without any complications from 
other codes written in C or C++. Using the library from lan- 
guages like Python and Java should also be straightforward, but 
requires some glue code. 

For Fortran the situation is unfortunately a bit more compli- 
cated, because a well-defined interface between Fortran and C 
has only been introduced with the Fortran2003 standard, which 
is not yet universally supported by compilers. Due to the phi- 
losophy of incrementally building a job list and finally execut- 
ing it (see Sect. 43}, simply wrapping the C functions using 
cfortran.rp"|g ue code does not work reliably, so that build- 
ing a portable interface is a nontrivial task. Nevertheless, the 
Fortran90 code beam2alm of the Planck mission's simulation 
package has been successfully interfaced with libpsht. 



5. Benchmarks 

In order to evaluate the reliability and efficiency of the library, a 
series of tests and comparisons with existing implementations 
were performed. All of these experiments were executed on 
a computer equipped with 64GB of main memory and Xeon 
E7450 processors (24 CPU cores overall) running at 2.4 GHz. 
gcc version 4.4.2 was used for compilation. 



5. 1 . Correctness and accuracy 

The validation of libpsht's algorithms was performed in two 
stages: first, the result of a backward transform was compared 
to that produced with another SHT library using the same input, 
and afterwards it was verified that a pair of backward and for- 
ward SHTs reproduces the original a\ m with sufficient precision. 

All calculations in this section start out with a set of s ay, that 
consists entirely of uniformly distributed random numbers in the 
range (-1; 1), except for the imaginary parts of the ,a/ o, which 
must be zero for symmetry reasons, and coefficients with I < \s\, 
which have no meaning in a spin-s transform and are set to zero 
as well. 



5.1 .1 . Validation against other implementations 

For this test, a set of a; m with l m . dx = 2048 was generated us- 
ing the above prescription and converted to a A^ s id e = 1024 
HEALPix map by both libpsht and the Fortran HEALPix fa- 
cility synfast. This was done for spins and 2, mimicking the 
synthesis of a polarised sky map. Comparison of the resulting 
maps showed a maximum discrepancy between individual pix- 
els of 1.55 x 10~ 7 , and the RMS error was 6.3 x 10~ 13 , which 
indicates a very good agreement. 



5.1 .2. Accuracy of recovered a lm 

An individual accuracy test consists of a map synthesis, followed 
by an analysis of the obtained map, producing the result s ci\ m . 



10" 

io- 1 

10-' 
IO" 1 



















V 

■ 














■-- m 








ECP — i — 
Gauss — x — 
HEALPix — *— 

HEALPix4 b 

HEALPix8 — ■-- 



512 1024 



Fig. 2. e rms for libpsht transform pairs on ECP, Gaussian and 
HEALPix grids for various Z max . Every data point is the maxi- 
mum error value encountered in transform pairs of all supported 
spins. "Healpix4" and "Healpix8" refer to iterative analyses with 
4 and 8 steps, respectively. 
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Fig. 3. e max for libpsht transform pairs on ECP, Gaussian and 
HEALPix grids for various Z max . Every data point is the maxi- 
mum error value encountered in transform pairs of all supported 
spins. "Healpix4" and "Healpix8" refer to iterative analyses with 
4 and 8 steps, respectively. 



Two quantities are used to assess the agreement between 
these two sets of spherical harmonic coefficients; they are de- 
fined as 



^rms • 



\s@lm s@lm\ 



\ Tilm \s a lm\ 

max(|5l(. s a,,„ - 

Im 



and 



iS/m)l. |3(jfl/m _ s&lm)\) ■ 



(18) 
(19) 



1 1 http : / /www- zeus . desy . de/~burow/cf ortran/ 



A variety of such tests was performed for different grid ge- 
ometries, Z max , and spins; the results are shown in Figs. |2| and 
® 

For ECP and "Gaussian" grids, each ring consisted of 2Z max + 
1 pixels, and the grids contained 2/ max + 2 and Z max + 1 rings, 
respectively. The § y for the Gaussian grid were chosen to coin- 
cide with the roots of the Legendre polynomial Pi m!a +i("ff)- Under 
these conditions, and with appropriately chosen pixel weights 
(taken from |Driscoll & Healy]|1994| and |Doroshkevich et aL 
2005 ), the transform pair should reproduce the original ai m ex- 



actly, were it not for the limited precision of floating-point arith- 
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Table 1. Single-core CPU time for various SHTs with l„ 
2048. 



Direction 


Spin 


ECP 


Gaussian 


HEALPix 


forward 





15.14s 


7.89 s 


13.11s 


backward 





14.74 s 


7.53 s 


12.65 s 


forward 


1 


29.49 s 


15.36 s 


25.05 s 


backward 


1 


27.83 s 


14.30 s 


23.85 s 


forward 


2 


31.91s 


16.91s 


28.15 s 


backward 


2 


30.33 s 


15.60 s 


26.51s 



Notes. The ECP grid had 2/ max + 1 pixels per ring and 2/ max + 2 rings, 
the Gaussian grid had 2/ raax + 1 pixels per ring and Z milx + 1 rings, and 
the iVjide parameter of the HEALPix grid was Z max /2. 




metics. In fact, the errors measured for SHTs on those two grids 
are extremely small and close to the theoretical limit. 

For the transform pairs on HEALPix grids, a resolution pa- 
rameter of A^ s id e = ^max/2 was chosen. Here, the recovered s 2/ m 
will only be an approximation of the original s ai m ; if higher accu- 
racy is required, this approximation can be improved by Jacobi 
iteration, as described in the HEALPix documentation. The fig- 
ures show clearly that s rms is around 10~ 3 for the simple trans- 
form pair and can be reduced dramatically by iterated analysis. 

5.2. Time consumption 
5.2.1 . Relative cost of SHTs 

Table [T] shows the single-core CPU times for SHTs of different 
directions, spins and grids, but with identical band limit, in order 
to illustrate their relative cost. It is fairly evident that forward and 
backward SHTs with otherwise identical parameters require al- 
most identical time, with the a; m — > map direction being slightly 
slower. Since this relation also holds for other band limits, only 
the timings for one direction will be shown in the following fig- 
ures, in order to reduce clutter. 

SHTs with |j| > take roughly twice the time of correspond- 
ing scalar ones; here the s = 2 case is slightly more expensive 
than s = 1, because computing the zFf from A/,,, requires more 
operations than computing the \F f . 

The timings also indicate the scaling of the SHT cost with 
N§, which is practically equal for HEALPix and ECP grids, but 
half as high for the Gaussian grid. 

These timings, as well as all other timing results for libpsht 
in this paper, were obtained using SSE2-accelerated code. 
Switching off this feature will result in significantly longer ex- 
ecution times; on the benchmark computer the factor lies in the 
range of 1.6 to 1.8. 



5.2.2. Scaling with / max 

SHTs of different spins and a wide range of band limits were per- 
formed on a single CPU core, in order to assess the correlation 
between CPU time and Z max ; the results are shown in Fig. [4] As 
expected, the time consumption scales almost with / max ; devia- 
tions from this behaviour only become apparent for lower band 
limits, where SHT components of lower overall complexity (like 
the FFTs, computation of the A\ m and B/ m coefficients and initial- 
isation of data structures) are no longer negligible. Switching to 
a normalised representation of the CPU time, as was done in 
Fig. [5] makes this effect much more evident; even for transforms 



Fig. 4. Timings of single-core forward SHTs on a HEALPix grid 
with A^ s ide = /max/2. The annotation "polarised" refers to a com- 
bined SHT of spin and 2 quantities, which is a very common 
case in CMB physics. 




Fig. 5. Same as Fig. [4] but with CPU time divided by l^ m N§ 



of very high band limit the curves are not entirely horizontal, 
which would indicate a pure <9(7 max A^) algorithm. 



5.2.3. Performance comparison with Fortran HEALPix 

Fig. [6] shows the ratio of CPU times for equivalent SHTs car- 
ried out using the synfast facility of Fortran HEALPix and 
libpsht, using identical hardware resources. It allows several 
deductions: 

- In all tested scenarios, libpsht's implementation is signif- 
icantly faster, even when precomputed coefficients are used 
with synfast. 

- The performance gap is markedly wider for s — trans- 
forms, which indicates that libpsht's /-recursion is the part 
which has been accelerated by the largest factor in compari- 
son to the Fortran HEALPix SHT library. 

- The polarised synfast SHTs only show a marginal speed 
improvement when using precomputed coefficients; for the 
scalar transforms, the difference is much more notice- 
able. Although this observation has no direct connection to 
libpsht, it is nevertheless of interest: most likely the miss- 
ing acceleration is caused by saturation of the memory band- 
width, as discussed in Sect. |3.1.3[ and confirms the decision 
not to introduce such a feature into libpsht. 
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■ - ^ spin=0 — i — 
polarised — x— 
spin=0, precomputed 
polarised, precomputed s 



Table 2. Speedup for simultaneous vs. consecutive SHTs on a 
HEALPix grid with N side = 1024 and l maK = 2048. 



1024 



2048 



4096 



Fig. 6. CPU time ratios between the Fortran HEALPix synf ast 
code and libpsht for various backward SHTs on a HEALPix 
grid with N side = Wx/2. 



backward, s=2 
ideal scaling 



number of threads 



Fig. 7. Elapsed wall clock time for a backward SHT with s — 2, 
^max = 4096, and N side = 2048 on a HEALPix grid, run with a 
varying number of OpenMP threads. 



5.2.4. Scaling with number of cores 

As illustrated in Fig. [T] all CPU-intensive parts of libpsht's 
transform algorithm are instrumented for parallel execution on 
multicore computers using OpenMP directives. Fig.|7]shows the 
wall clock times measured for a single SHT that was run using 
a varying number of OpenMP threads. The maximum number 
of threads was limited to 16 (in contrast to the maximum use- 
ful value of 24 on the benchmark computer), because the ma- 
chine was not completely idle during the libpsht performance 
measurements, and the scaling behaviour at larger numbers of 
threads would have been influenced by other running tasks. 

It is evident that the overhead due to workload distribution 
among several threads is quite small: for the 16-processor run, 
the accumulated wall clock time is only 18% higher than for the 
single-processor SHT. An analogous experiment was performed 
with the Fortran HEALPix code synf ast; here the overhead for 
16 threads was around 35%. 



5.2.5. Comparison with theoretical CPU performance 

For this experiment, the library's code was enhanced to count the 
number of floating-point operations that were carried out during 



Transforms 


-^simul/s 


-^separate/s 


Speedup 


laO 


12.65 


12.65 


1.00 


2a0 


18.06 


25.30 


1.40 


3a0 


23.47 


37.95 


1.62 


5a0 


34.77 


63.25 


1.82 


lOaO 


62.56 


126.50 


2.02 


3a0+3m0 


41.72 


77.46 


1.85 


Ia0+lal + la2+lm0+lml + lm2 


78.79 


129.44 


1.64 


4m2 


63.68 


112.40 


1.77 



Notes. The transforms are specified in an abbreviated fashion, where 
"5a2" signifies "5 «;„, — > map SHTs of spin 2", and "3m0" stands for 
"3 map — > a/,,, SHTs of spin 0". The runs were performed on a single 
CPU core. 



the SHT. This instrumentation was only done in the parts of the 
code with <9(Z^ ax A^) complexity, i.e. everything related to cre- 
ation and usage of the s Ai, n {ff). Operations that were necessary 
only for the purpose of numerical stability (e.g. due to the range- 
extended floating-point format) were not taken into account. 
Dividing the resulting number of operations by the CPU time 
for a single-core uninstrumented run of the same SHT provides a 
conservative estimate for the overall floating-point performance 
of the algorithm. Measuring s — and s — 2 ai, n — > map SHTs 
on a HEALPix grid with N side = 1024 and Z max = 2048 resulted 
in 2.1GFlops/s and 4.1GFlops/s, respectively. This corresponds 
to 0.88 and 1.71 floating-point operations per clock cycle, or 
22% and 43% of the theoretical peak performance. Both of these 
percentages are quite high for an implementation of a nontrivial 
algorithm, which can be seen as a hint that there is not much 
more room for further optimisation on this hardware architec- 
ture. In other words, further speedups can only be achieved by a 
fundamental change in the underlying algorithm. 

There are several reasons for the obvious performance dis- 
crepancy between scalar and s — 2 transforms. To a large part 
it is probably caused by the internal dependency chains of the 
X\ m recursion: even though the CPU could in principle start new 
operations every clock cycle, it has to wait for some preceding 
operations to finish first (which takes several cycles), because 
their results are needed in the next steps. When computing a re- 
currence, the CPU is therefore spending a significant fraction of 
time in a waiting state (see, e.g., |Fog |20 1 0| l . For the s — 2 trans- 
form this is much less of an issue, since many more floating- 
point operations must be carried out, which are not interdepen- 
dent, and the time spent on the /-recursion becomes subdomi- 
nant. 

Other reasons for the lower performance of the recursion are 
the necessity to deal with extended floating-point numbers and 
its rather high ratio of memory accesses compared to arithmetic 
operations. 

5.2.6. Gains by using simultaneous transforms 

The CPU time savings achievable by performing multiple SHTs 
together instead of one after another were measured on a 
HEALPix grid with JV side = 1024 and / max = 2048. Table [2] 
summarises the measured CPU times for an unsystematic selec- 
tion of simultaneous and successive transforms, as well as the 
speedup. 

The results show that for a large number of simultaneous 
SHTs the computation time can be cut roughly in half; this is 
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Imax 
Imax 3 
libpsht(s=0) 
synfast(s=0) 
libpsht(polarised) 
synfast(polarised) 
synfast(s=0,precompijted) 
synfast(polarised,precomputed) 



2048 

I™. 



4096 



Fig. 8. Memory consumption of various SHTs performed with 
libpsht and synfast on a HEALPix grid with A^ s id e = Wx/2. 
Input and output data are stored in single-precision format. 



the same asymptotic limit that also holds for SHTs with com- 
pletely precomputed s At m (ff) (see Sect. 3.1.3 I, but the presented 
approach has dramatically lower memory needs and can there- 
fore be used for higher-resolution transforms. It is also evident 
that forward and backward transforms, as well as transforms of 
different spins, can be freely mixed while still preserving the 
performance increase. 



capabilities of the CPUs, as well as shared memory parallelism 
are exploited, resulting in further performance gains. 

Despite this range of capabilities, the package only consists 
of approximately 7000 lines of code, including the FFT imple- 
mentation and in-line documentation for developers; except for 
a C compiler, it has no external dependencies. 

Libpsht has been integrated into the simulation package of 
the Planck mission (Level-S, Rein ecke et al.|2 006), where it is in- 
terfaced with a development version of the C++ HEALPix pack- 
ag^jand a Fortran90 code performing SHTs on detector beam 
patterns; this demonstrates the feasibility of interfacing the li- 
brary with C++ and Fortran code. 

There are two areas in which libpsht's functionality should 
probably be extended: it currently does not provide transforms 
for spins larger than 2, and there is no active support for dis- 
tributing SHTs over several independent computing nodes. As 
was mentioned in Sect. 3.1.2 addressing the first point is prob- 
ably not too difficult, and enhancing the library with s Y/ m gen- 
erators based on Wigner d matrix elements is planned for one 
of the next releases. Regarding the second point, a combina- 
tion of libpsht's central SHT functionality with the paralleli- 
sation strategy implemented by Radek Stompor's s2hat library 
appears very promising and is currently being investigated. 

Libpsht is distributed under the terms of the GNU General 
Public License (GPL) version 2 (or later versions at the user's 
choice); the most recent version, alongside online technical doc- 
umentation can be obtained at the URL http : / /sourcef orge . 
net/pro j ects/libpsht/ 



5.3. Memory consumption 

Fig. [8] shows the maximum amount of memory allocated during 
a variety of SHTs carried out using libpsht and the Fortran 
HEALPix synfast facility. Since the memory required for for- 
ward and backward transforms is essentially equal, only one di- 
rection has been plotted. The sizes of SHTs with Z max < 512 
could not be measured reliably due to their short running time 
and are therefore not shown. 

As can be seen, libpsht needs slightly less memory than 
synfast for equivalent operations; overall, the scaling is very 
close to the expected Z^ ax if no precomputed s /l/ m (#) are used. 
For comparison purposes, a few data points for synfast runs 
using precomputed scalar and tensor s Ai m (§) were also plotted; 
they clearly indicate the extremely high memory usage of these 
transforms, which scales with Z^ ax . 

During the tests with multiple simultaneous threads it was 
observed that increasing the number of cores does not have a 
significant influence on the total required memory; for the con- 
crete test discussed in Sect. |5.2.4] memory usage for the run with 
16 threads was only 2% larger than for the single-threaded run. 



6. Conclusions 



Looking back at the goals outlined in Sect. 2.2 and the results of 
the benchmark computations above, it can be concluded that the 
current version of the libpsht package meets almost all spec- 
ified requirements. It implements SHTs that have the same or 
a higher degree of accuracy as the other available implementa- 
tions, can work on all spherical grids relevant to CMB studies, 
and is written in standard C, which is very widely supported. The 
employed algorithms are significantly more efficient than those 
published and used in this field of research before; their memory 
usage is economic and allows transforms whose input and output 
data occupy almost all available space. When appropriate, vector 
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